Code
library(Seurat)
library(targets)
library(clustree)
library(dplyr)
tar_make_future()Data preprocessing and QC
Here is some general information about the project and analysis. It can be edited in the file _metadata.yml.
library(Seurat)
library(targets)
library(clustree)
library(dplyr)
tar_make_future()library(ggplot2)
library(cowplot)
source("R/themes.R")This is the study description. It can be edited in the file _metadata.yml.
tar_load(obj_orig_meta)
tar_load(qc_groupby)Original cell counts per sample:
knitr::kable(as.data.frame(table(obj_orig_meta[[tar_read(qc_groupby)]])), col.names = c("sample","no of cells"))| sample | no of cells |
|---|---|
| Sample1 | 1500 |
| Sample2 | 1500 |
cellcounts = as.data.frame(table(obj_orig_meta[[tar_read(qc_groupby)]]))
cellcounts$Group = obj_orig_meta[,qc_groupby][match(cellcounts$Var1, obj_orig_meta[[tar_read(qc_groupby)]])]
ggplot(cellcounts, aes(x = Group, y = Freq)) + geom_jitter(width = 0.1) + lims(y = c(0,NA)) +
labs(x = "", y = "Cell count")We use four variables to compare the quality of each sample: Number of unique genes (“nFeature_RNA”), number of UMI counts (“nCounts_RNA”), % of mitochondrial RNA in the sample (“percent.mito”) and % of ribosomal RNA in the sample (“percent.ribo”). The quality control metrics are shown for each sample before and after filtering. In the unfiltered data, the dashed lines indicate thresholds that were used to filter the data. The thresholds used were the following:
tar_load(qc_filt_rules)
thresholds = data.frame(
Parameter = sapply(qc_filt_rules, function(x){
if(length(x)>3){
x[[4]]
}else{
paste(x[[1]], x[[2]])
}
}),
Value = sapply(qc_filt_rules, "[[", 3))| Parameter | Value |
|---|---|
| Min number of unique features per cell | 2e+02 |
| Max number of reads (unique UMIs) per cell | 1e+06 |
| Max % mitochondrial reads per cell | 2e+01 |
| Min % ribosomal reads per cell | 5e+00 |
tar_read(qc_unfiltered_vln)After the first filtering, the distribution of QC variables are as shown:
tar_read(qc_filtered_vln)In addition to filtering genes based on the number of cells expressing the gene, individual genes can be removed for reasons such as expected technical artefacts. To identify genes that may introduce biases, we visualize the genes with the highest median expression across cells.
tar_read(top_expr_boxplot)The removed genes contain the following patterns:
genes_to_remove = sapply(tar_read(qc_genesToRemove),
function(g){
paste0("^",g,"$")
})
patterns = c(tar_read(qc_genePatternsToRemove), genes_to_remove)
if(length(patterns)>0){
remove_df =
knitr::kable(data.frame(
Pattern = gsub("\\^","", gsub("\\$","",patterns)),
Rules =
sapply(patterns, function(x){
rules = c()
if(substr(x, 1, 1) == "^"){
rules = c(rules, "Begins with")
}
if(substr(x, nchar(x), nchar(x)) == "$"){
rules = c(rules, "Ends with")
}
paste0(rules, collapse = " & ")
})))
}else{
remove_df = "*No genes were selected for removal*"
}| Pattern | Rules |
|---|---|
| RP[SL] | Begins with |
| MT- | Begins with |
| MALAT1 | Begins with & Ends with |
In order to estimate whether there are differences between samples considering the cell cycle state at time of collection, and whether this needs to be accounted for downstream, “cell cycle scores” is added. This is an average of scaled expression of a list of genes indicative of the S phase and the G2-M phase in the cell cycle.
To specify which genes should be used for S phase and G2/M phase scores, edit the files parameters/cc_S_genes.csv and parameters/cc_G2M_genes.csv (note: these .csv-files should not contain column names). If no genes are indicated in these files, the default gene lists given by Seurat will be used (see Seurat::cc.genes.updated.2019$s.genes and Seurat::cc.genes.updated.2019$g2m.genes). Note that these default lists concern human genes.
tar_read(cc_violin)DFmessage = "Doublet removal was not performed."
if(tar_read(qc_runDF)){
DFmessage = 'In droplet-based sequencing, a number of droplets containing more than one cell is expected. In order to predict which "cells" in the dataset are actually composed of more than one cell, the package [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is used. Doublet cells are then removed from the dataset.
To evaluate whether the doublet prediction worked well, we look at two plots:
1. A preliminary UMAP in which doublets are indicated (doublets are likely to be found between clusters)
2. A violin plot showing the distribution of number of unique features per cell, split by predicted doublets and singlets in each sample (on average, doublets have a higher number of unique genes)'
}In droplet-based sequencing, a number of droplets containing more than one cell is expected. In order to predict which “cells” in the dataset are actually composed of more than one cell, the package DoubletFinder is used. Doublet cells are then removed from the dataset.
To evaluate whether the doublet prediction worked well, we look at two plots:
A preliminary UMAP in which doublets are indicated (doublets are likely to be found between clusters)
A violin plot showing the distribution of number of unique features per cell, split by predicted doublets and singlets in each sample (on average, doublets have a higher number of unique genes)
if(tar_read(qc_runDF)){
tar_read(doublet_plots)$dimplot & labs(title = "Doublets")
}if(tar_read(qc_runDF)){
tar_read(doublet_plots)$vlnplot
}The final number of cells are as follows:
tar_load(obj_filt_final_meta)knitr::kable(as.data.frame(table(obj_filt_final_meta[[tar_read(qc_groupby)]])), col.names = c("sample","no of cells"))| sample | no of cells |
|---|---|
| Sample1 | 828 |
| Sample2 | 552 |
cellcounts = as.data.frame(table(obj_filt_final_meta[[tar_read(qc_groupby)]]))
cellcounts$Group = obj_filt_final_meta[,qc_groupby][match(cellcounts$Var1, obj_filt_final_meta[[tar_read(qc_groupby)]])]
ggplot(cellcounts, aes(x = Group, y = Freq)) + geom_jitter(width = 0.1) + lims(y = c(0,NA)) +
labs(x = "", y = "Cell count")The distributions of the QC variables are also shown.
tar_read(qc_final_vln)if(tar_read(dimred_sct)){
normalization_method = paste0("Normalization was performed using the SCTransform method on the top ",
tar_read(dimred_nHVG)," most variable features.")
if(length(tar_read(dimred_varstoregress))>0){
normalization_method = paste0(normalization_method, " The following variables were regressed out: ",
paste0(tar_read(dimred_varstoregress), collapse = ", "),".")
}
}else{
normalization_method = "Normalization was performed using the NormalizeData method."
if(length(tar_read(dimred_varstoregress))>0){
normalization_method = paste0(normalization_method, " The following variables were regressed out when scaling the top ",
tar_read(dimred_nHVG)," most variable features: ",
paste0(tar_read(dimred_varstoregress), collapse = ", "),".")
}
}
if(!tar_read(load_joinlayers)){
HVGmessage = "*Note: This analysis uses the feature of layers introduced in Seurat v5. A feature of this functionality is that variance for each gene is calculated on each original sample individually, and the variability for each gene is calculated based on the ranks in each sample. This means that unlike previous Seurat versions, the genes with the most variability in the plot are not necessarily the most variable when combining the ranked lists.*"
}else{
HVGmessage = ""
}Normalization was performed using the NormalizeData method. The following variables were regressed out when scaling the top 2000 most variable features: nFeature_RNA.
tar_load(obj_pca_umap)Dimensional reduction was performed using PCA. To do this, the top 2000 most variable features were used. They are visualized below (in red), showing the mean expression and standardized variance for each gene. The top 20 genes are labeled.
top20 <- head(VariableFeatures(obj_pca_umap), 20)
LabelPoints(plot = VariableFeaturePlot(obj_pca_umap), points = top20, repel = TRUE)The top variable genes are used to produce an initial PCA.
addColors(DimPlot(obj_pca_umap, reduction = "pca", shuffle = TRUE), layer = 1)The top genes contributing positively or negatively to the first 4 principal components are as shown:
VizDimLoadings(obj_pca_umap, dims = 1:4, nfeatures = 20, balanced = TRUE, ncol = 4) &
geom_vline(xintercept = 0, linetype = 2, color = "grey") &
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))The variance explained by each PC helps us estimate how many PCs will be useful to include in order to retain maximum variance while reducong noise. An “elbow plot” is useful for this.
ElbowPlot(obj_pca_umap, ndims = 50)A UMAP was produced to visualize the relative similarities between cells. For this, the following parameters were set:
shortenvector = function(x){
if(length(x)>1){
if(is.numeric(x) | is.integer(x)){
if(identical(as.numeric(x), as.numeric(min(x):max(x)))){
x = paste0(min(x), ":", max(x))
}
}
x = paste0(x, collapse = ", ")
}
x
}
umap_params = data.frame(
Parameter = c("Number of neighbors",
"Number of components to produce",
"Minimum distance",
"Number of PCs to consider"),
Value = c(
shortenvector(tar_read(dimred_umap_nn)),
shortenvector(tar_read(dimred_umap_ncomp)),
shortenvector(tar_read(dimred_umap_mindist)),
shortenvector(tar_read(dimred_umap_dims))
)
)| Parameter | Value |
|---|---|
| Number of neighbors | 30 |
| Number of components to produce | 2 |
| Minimum distance | 0.3 |
| Number of PCs to consider | 1:20 |
addColors(DimPlot(obj_pca_umap, shuffle = TRUE), layer = 1)FeaturePlot(obj_pca_umap, c("nFeature_RNA","nCount_RNA","percent.mito", "percent.ribo","percent.hb"), ncol = 2) &
theme(plot.title = element_text(size = 12),
legend.text = element_text(size =10),
axis.text = element_blank(),
axis.ticks = element_blank())rm(obj_pca_umap)
gc()tar_load(obj_int_umap)reductions = c(
"pca_nonint","umap_nonint",
ifelse(tar_read(load_joinlayers), "pca","integrated.cca"),
"umap"
)
plot_grid(
plotlist =
lapply(reductions, function(red){
addColors(DimPlot(obj_int_umap, reduction = red, shuffle = TRUE) &
NoLegend() &
theme(axis.text = element_blank(), axis.title = element_blank(),
axis.ticks = element_blank()) &
labs(title = red), layer = 1)
}))addColors(DimPlot(obj_int_umap, reduction = "umap", shuffle = TRUE,
split.by = tar_read(int_split)), layer = 1)rm(obj_int_umap)
gc()tar_load(obj_clus)clustername = paste0("clusters_",tar_read(clus_reduction),"_")
clustering = colnames(obj_clus@meta.data)[grepl(clustername, colnames(obj_clus@meta.data))]
plot_grid(
plotlist =
lapply(clustering, function(clus){
shadow_text(addColors(DimPlot(obj_clus, reduction = "umap", shuffle = TRUE,
group.by = clus, label = TRUE) &
NoLegend() &
theme(axis.text = element_blank(), axis.title = element_blank(),
axis.ticks = element_blank()) &
labs(title = clus),
scale_type = "colour", layer = 1))
}))clustree(obj_clus, prefix = clustername)cluster_res = paste0("clusters_",
tar_read(clus_reduction),
"_",
tar_read(clus_de_res))For annotation purposes and comparison between samples, a clustering resolution is chosen. For this report, the cluster of interest is cluster_res.
obj_clus$Cluster = factor(obj_clus@meta.data[[cluster_res]])
addColors(ggplot(obj_clus@meta.data,
aes(x = Sample)) +
geom_bar(aes(fill = Cluster),
position = "fill"),
"fill", layer = 1)addColors(ggplot(obj_clus@meta.data,
aes(x = Cluster)) +
geom_bar(aes(fill = Sample), position = "fill"),
scale_type = "fill", layer = 1)tar_read(cluster_de_barplot)addColors(VlnPlot(obj_clus,
features = tar_read(qc_plotvars),
group.by = "Cluster",
pt.size = 0, ncol = 2) & violintheme(),
scale_type = "fill")addColors(VlnPlot(obj_clus,
features = c("S.Score","G2M.Score"),
group.by = "Cluster",
pt.size = 0, ncol = 2) & violintheme(),
scale_type = "fill")top3 = unique(tar_read(cluster_de)$de_res %>%
group_by(cluster) %>%
slice_min(order_by = p_val_adj, n = 3) %>%
pull(gene))
addColors(VlnPlot(obj_clus, group.by = "Cluster",
features = top3, pt.size = 0, assay = "RNA") &
violintheme(), scale_type = "fill")knitr::kable(targets::tar_meta(fields = warnings, complete_only = TRUE))| name | warnings |
|---|---|
| cc_violin | Subsetting quosures with is deprecated as of rlang 0.4.0Please use quo_get_expr instead.This warning is displayed once every 8 hours. |
| doublet_plots | The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session |
| metadata | incomplete final line found by readTableHeader on datametadata.csv |
| obj_cc | The following features are not present in the object RAD51, not searching for symbol synonyms. The following features are not present in the object PIMREG, CDC25C, not searching for symbol synonyms |
| obj_df | The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session. UNRELIABLE VALUE One of the future.apply iterations future_lapply1 unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify future.seedTRUE. This ensures that proper, parallelsafe random numbers are produced via the LEcuyerCMRG method. To disable this check, use future.seed NULL, or set option future.rng.onMisuse to ignore. |
| obj_filt | No layers found matching search pattern provided. data layer is not found and counts layer is used |
| obj_int | Different features in new layer data than already exists for scale.data. Different features in new layer data than already exists for scale.data. Layer counts isnt present in the assay object returning NULL |
| obj_int_umap | The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session |
| obj_pca_umap | The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session |
sessionInfo()R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS/LAPACK: /Users/jenfra/miniconda3/envs/seurat5/lib/libopenblasp-r0.3.28.dylib; LAPACK version 3.12.0
locale:
[1] C/UTF-8/C/C/C/C
time zone: Europe/Stockholm
tzcode source: system (macOS)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] shadowtext_0.1.4 cowplot_1.1.3 dplyr_1.1.4 clustree_0.5.1
[5] ggraph_2.2.1 ggplot2_3.5.1 targets_1.9.1 Seurat_5.1.0
[9] SeuratObject_5.0.2 sp_2.1-4
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 jsonlite_1.8.9 magrittr_2.0.3
[4] ggbeeswarm_0.7.2 spatstat.utils_3.1-0 farver_2.1.2
[7] rmarkdown_2.28 vctrs_0.6.5 ROCR_1.0-11
[10] memoise_2.0.1 spatstat.explore_3.2-6 htmltools_0.5.8.1
[13] sctransform_0.4.1 parallelly_1.38.0 KernSmooth_2.23-24
[16] htmlwidgets_1.6.4 ica_1.0-3 plyr_1.8.9
[19] cachem_1.1.0 plotly_4.10.4 zoo_1.8-12
[22] igraph_2.0.3 mime_0.12 lifecycle_1.0.4
[25] pkgconfig_2.0.3 Matrix_1.6-5 R6_2.5.1
[28] fastmap_1.2.0 fitdistrplus_1.2-1 future_1.34.0
[31] shiny_1.9.1 digest_0.6.37 colorspace_2.1-1
[34] patchwork_1.3.0 ps_1.8.0 tensor_1.5
[37] RSpectra_0.16-2 irlba_2.3.5.1 base64url_1.4
[40] labeling_0.4.3 progressr_0.14.0 fansi_1.0.6
[43] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
[46] abind_1.4-5 compiler_4.3.3 withr_3.0.1
[49] backports_1.5.0 viridis_0.6.5 fastDummies_1.7.4
[52] ggforce_0.4.2 MASS_7.3-60.0.1 tools_4.3.3
[55] vipor_0.4.7 lmtest_0.9-40 beeswarm_0.4.0
[58] httpuv_1.6.15 future.apply_1.11.2 goftest_1.2-3
[61] glue_1.8.0 callr_3.7.6 nlme_3.1-165
[64] promises_1.3.0 grid_4.3.3 checkmate_2.3.2
[67] Rtsne_0.17 cluster_2.1.6 reshape2_1.4.4
[70] generics_0.1.3 gtable_0.3.5 spatstat.data_3.1-2
[73] tidyr_1.3.1 data.table_1.15.4 tidygraph_1.3.0
[76] utf8_1.2.4 spatstat.geom_3.2-9 RcppAnnoy_0.0.22
[79] ggrepel_0.9.6 RANN_2.6.2 pillar_1.9.0
[82] stringr_1.5.1 spam_2.11-0 RcppHNSW_0.6.0
[85] later_1.3.2 splines_4.3.3 tweenr_2.0.3
[88] lattice_0.22-6 survival_3.7-0 deldir_2.0-4
[91] tidyselect_1.2.1 miniUI_0.1.1.1 pbapply_1.7-2
[94] knitr_1.48 gridExtra_2.3 scattermore_1.2
[97] xfun_0.48 graphlayouts_1.2.0 matrixStats_1.4.1
[100] stringi_1.8.4 lazyeval_0.2.2 yaml_2.3.10
[103] evaluate_1.0.1 codetools_0.2-20 tibble_3.2.1
[106] cli_3.6.3 uwot_0.1.16 secretbase_1.0.3
[109] xtable_1.8-4 reticulate_1.39.0 munsell_0.5.1
[112] processx_3.8.4 Rcpp_1.0.13 globals_0.16.3
[115] spatstat.random_3.2-3 png_0.1-8 ggrastr_1.0.2
[118] parallel_4.3.3 dotCall64_1.2 listenv_0.9.1
[121] viridisLite_0.4.2 scales_1.3.0 ggridges_0.5.6
[124] leiden_0.4.3.1 purrr_1.0.2 rlang_1.1.4